Using Spark with compressed data

Run this Notebook on GCP with Dataproc
Spark can automatically uncompress from a variety of formats

Set-up COS functions for GCS

File is compressed - not a CSV

Enable repl.eagerEval

This will output the results of DataFrames in each step without the need to use df.show() and also improves the formatting of the output

Which product category gets most reviews

Which product category gets highest / lowest review star ratings

Do you see seasonality in reviews? Yes.

Is every product category getting consistent number of reviews throughout the year? No. Use Gift Card as an example, it has more reviews during holiday seasons around December.

image.png

Any product categories that are getting more reviews during certain times of the year? If yes, which product categories and what times of the year?

Apparel: December

Automotive: August

Baby: December

Beauty: December

Books: December

Camera: December

Digital_ebook_purchase: March, December

Digital_music_purchase: March, December

Digital_software: March

Digital_video_download: August, December

Digital_video_games: March, December

Electronics: December

Furniture: March, December

Gift card: December

Grocery: March December

Health & personal care: December

home: December

home entertainment: December

home improvement: December

Jewelry: December

Kitchen: December

Lawn and Garden: July

Luggage: December

Major: December

Mobile apps: March, December

Mobile electronics: December

Music: December

Musical instruments: December

Office products: December

Outdoors: august, December

PC: December

Personal care appliances: December

Pet products: December

shoes: December

software: December

sports: December

tools: December

toys: December

video: December

video DVD: December

video GAMES: December

watches: December

wireless: December

Under this dataframe reviews_pd, for all its unique categories (there are 43 unique categories) under column "product_category", I want to draw line chart to visualize the result for each unique category with each year as a line and x axis as month. So the result should be 43 charts.

Which reviews are getting the most helpful votes / total votes?

Any correlation between the length of review headline (in number words) and the "helpfulness" of the review?

The correlation coefficient of approximately 0.031 between the length of review headlines and the number of helpful votes suggests a very weak positive correlation. This indicates that there is a slight tendency for reviews with longer headlines to receive slightly more helpful votes, but the relationship is not strong.

Any correlation between the length of review body (in number words) and the "helpfulness" of the review?

The correlation coefficient of approximately 0.138 between the length of review bodies and the number of helpful votes indicates a weak positive correlation. This suggests that there is a slight tendency for reviews with longer bodies to receive slightly more helpful votes.

Do you see any correlation between how many reviews a certain customer (customer_id) published and the "helpfulness" of the reviews?

The correlation coefficient of approximately 0.008 between the number of reviews published by a customer (review count) and the number of helpful votes received by those reviews suggests a very weak positive correlation. This indicates that there is a slight tendency for customers who publish more reviews to receive slightly more helpful votes, but the relationship is not strong.

Explain how you accomplish this computational effectiveness based on your knowledge of Spark

In Spark, computational effectiveness is achieved through distributed computing across a cluster of machines. Spark efficiently distributes data processing tasks across the cluster and optimizes the execution of these tasks through various mechanisms. Here's how Spark achieves computational effectiveness:

Distributed Data Processing: Spark distributes data across the cluster in partitions, allowing parallel processing of data. Each partition of data is processed independently on different nodes in the cluster, enabling high throughput and scalability.

In-Memory Computation: Spark utilizes in-memory computation to cache intermediate data in memory, reducing the need to read from disk repeatedly. By keeping data in memory, Spark minimizes the latency associated with accessing data from disk, resulting in faster processing times.

Lazy Evaluation: Spark employs lazy evaluation, meaning that transformations on the data are not executed immediately. Instead, Spark builds up a directed acyclic graph (DAG) representing the sequence of transformations to be applied to the data. This allows Spark to optimize the execution plan by combining and reordering transformations before executing them, improving performance.

Task Pipelining: Spark pipelines tasks together to minimize data shuffling and reduce overhead. Tasks that can be executed together are grouped into stages, and stages are executed one after another with minimal data movement between them, reducing communication overhead and improving efficiency.

Partitioning and Data Locality: Spark optimizes data partitioning and scheduling to ensure that data processing tasks are executed on nodes where the data resides (data locality). This minimizes data movement across the network and maximizes the utilization of available resources, improving computational efficiency.

Distributed Data Structures and Operations: Spark provides distributed data structures such as RDDs (Resilient Distributed Datasets) and DataFrames, along with a rich set of parallelized operations (transformations and actions) to process data efficiently in parallel across the cluster.

Optimized Shuffle Operations: Spark optimizes shuffle operations, such as groupBy, join, and sortBy, to minimize data shuffling and reduce network traffic. Spark performs partition-aware shuffle, where data is shuffled only within partitions, reducing the amount of data transferred across the network.

By leveraging these techniques, Spark achieves computational effectiveness, enabling scalable, high-performance data processing for a wide range of applications.